CPU Accelerated Fine-Tuning for Image Segmentation using PyTorch¶

Abstract: Fine-tuning neural networks has historically been quite slow and cumbersome on CPUs. However, with mixed precision BF16 training and the Intel® Extension for PyTorch*, fine-tuning is feasible on a CPU and perhaps even preferred where cost and availability are key factors. In this tutorial, I will walk you through a real-world example of training an AI image segmentation model using PyTorch* 1.13.1 (with ResNet34 + UNet architecture); the model will learn to identify roads and speed limits from satellite images, all on the newly released 4th Gen Intel® Xeon® Scalable Processor (Sapphire Rapids).


Author: Ben Consolvo | Company: Intel | Date: March 4, 2023


Please join me on Intel's Developer Discord to have further discussion following the talk. My user is silvos#5002
Invite link: https://discord.gg/rv2Gp55UJQ

You can also connect with me here:

| | | | | | | | ¶



Introduction

I am excited to be able to show you today that CPUs are viable to train a deep learning model for a fine-tuning example.

In this tutorial, you will learn how to accelerate a PyTorch training job with Sapphire Rapids. We will use the Intel Extension for PyTorch (IPEX) library to enable built-in AI acceleration on the Sapphire Rapids CPU to work with minimal code changes. You will learn how to train a satellite image with matching street labels (pixel segmentation task) example.

The potential cost savings for renting a CPU on one of the major CSPs, instead of a GPU, are significant. However, I will say that the latest CPU processors are still being rolled out to the CSPs, and it is typical for there to be a lag time to availability on the major CSPs from launch-date. The Sapphire Rapids CPU I will show you today is a beast! It is being hosted on the Intel® Developer Cloud*, which you can sign up for the Beta here: cloud.intel.com.

Coming soon, I will provide a new tutorial around using PyTorch 2.0 after its release on March 14, 2023.

Here are the results of the tests I ran to understand how Sapphire Rapids


Key References

Much of my material and code was taken from the CRESI repository below. I've adapted it for use on Sapphire Rapids, with optimizations from Intel Extension for PyTorch.

  • https://github.com/avanetten/cresi

In particular, I was able to piece together a workflow using the notebooks here:

  • https://github.com/avanetten/cresi/tree/main/notebooks

I also highly recommend these Medium articles for another detailed explanation of how to get started with the SpaceNet5 data:

  • The SpaceNet 5 Baseline — Part 1: Imagery and Label Preparation
  • The SpaceNet 5 Baseline — Part 2: Training a Road Speed Segmentation Model
  • The SpaceNet 5 Baseline — Part 3: Extracting Road Speed Vectors from Satellite Imagery
  • SpaceNet 5 Winning Model Release: End of the Road

I also referenced 2 Hugging Face blogs by Julien Simon here. He ran his tests on the AWS* instance r7iz.metal-16xl :

  • Accelerating PyTorch Transformers with Intel Sapphire Rapids, part 1
  • Accelerating PyTorch Transformers with Intel Sapphire Rapids, part 2

4th Gen Intel® Xeon® Scalable Processor (Sapphire Rapids) - CPU specs

  • 2 sockets
  • 56 physical cores (112 vCPUs) per socket
  • 224 vCPUs total
  • 504 GB of memory
  • New Intel® Advanced Matrix Extensions (Intel® AMX), a built-in AI acceleration engine.

Operating System:

  • Ubuntu* 22.04

Key Software:

  • Intel Extension for PyTorch (GitHub link | Product page)

Model Architecture:

  • ResNet34 + UNet (see winning model from SpaceNet 5 Winning Model Release: End of the Road).
In [1]:
!lscpu 
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  224
  On-line CPU(s) list:   0-223
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8480+
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  56
    Socket(s):           2
    Stepping:            8
    CPU max MHz:         3800.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss 
                         ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
                          arch_perfmon pebs bts rep_good nopl xtopology nonstop_
                         tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes6
                         4 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xt
                         pr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_
                         deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3d
                         nowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpci
                         d_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibr
                         s_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad
                          fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid c
                         qm rdt_a avx512f avx512dq rdseed adx smap avx512ifma cl
                         flushopt clwb intel_pt avx512cd sha_ni avx512bw avx512v
                         l xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc 
                         cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni 
                         avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_ac
                         t_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke 
                         waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni a
                         vx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_d
                         etect cldemote movdiri movdir64b enqcmd fsrm md_clear s
                         erialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16
                          amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   5.3 MiB (112 instances)
  L1i:                   3.5 MiB (112 instances)
  L2:                    224 MiB (112 instances)
  L3:                    210 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-55,112-167
  NUMA node1 CPU(s):     56-111,168-223
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
                          and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                          sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB fillin
                         g, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Flags for AMX¶

Under "flags", you can see amx_bf16, amx_tile, and amx_int8, so you know AMX is available for use.

Environment setup¶

The environment setup will depend on what already comes installed on the machine you are using, but here is a guide on what I did. To simplify the environment setup, I wrote out here what I did to get my environment set up properly. Please note that you may need to make adjustments depending on the build and configuration of your machine. I have just left these to be run in the command line, as opposed to in the notebook, as the text output gets too long for a notebook.

  1. Registration on Intel Developer Cloud

To get access to the latest Intel computing hardware: - You can sign up for the beta here: cloud.intel.com - Head over to the Get Started page on how to set up an instance - I used the "4th Generation Intel® Xeon® Scalable processors" instance.


  1. Building backend software with apt-get

Installing apt-get packages

#Fix broken installs (if any) for apt-get
sudo apt-get --fix-broken install

#Update apt-get
sudo apt-get update
sudo apt-get autoremove

#Install packages
sudo apt-get install -y \
    cdo \
    nco \
    gdal-bin \
    libgdal-dev \
    libjemalloc-dev \
    awscli \
    cmake \
    apt-utils \
    python3-dev \
    libssl-dev \
    libffi-dev \
    libncurses-dev \
    libgl1 \
    ffmpeg \
    libsm6 \
    libxext6 \
    numactl

  1. Setting up Anaconda environment

Building Anaconda environment and installing some packages with conda install.

#Download and install Miniconda distribution of Anaconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py38_23.1.0-1-Linux-x86_64.sh -O ~/miniconda.sh && \
    sudo /bin/bash ~/miniconda.sh -b -p /opt/conda && \
    rm ~/miniconda.sh && \
    sudo /opt/conda/bin/conda clean -tip && \
    sudo ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
    sudo echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
    source ~/.bashrc

#Create new conda environment
conda create --name py39 python=3.9
echo "conda activate py39" >> ~/.bashrc
source ~/.bashrc

#install conda packages
conda install -c conda-forge libgdal
conda install tiledb=2.2
conda install poppler
conda install intel-openmp
conda install gperftools -c conda-forge

  1. Installing pip packages

Installing PyTorch for CPU, as well as a requirements.txt file of packages

#pip upgrades
python -m pip install --upgrade pip wheel
pip3 install setuptools==57.5.0 #for 3rd gen xeon

#install torch for CPU
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu

#install requirements packages
pip3 install -r requirements.txt

#put the newly created conda environment into Jupyter*
python -m ipykernel install --user --name=py39

#install oneCCL if want to launch more complicated (distributed) jobs
#python -m pip install oneccl_bind_pt==1.13 -f https://developer.intel.com/ipex-whl-stable-cpu

The requirements.txt is:

scikit-image
jupyterlab
ipykernel
numpy
pandas
scipy
matplotlib
fiona
opencv-python
shapely
imagecodecs
tqdm
osmnx
torchsummary
geopandas==0.6.3
tensorboardX
tensorboard
networkx==2.8
numba
utm
intel_extension_for_pytorch
gdal==3.0.4

  1. Cloning cresi repository

The cresi repository contains a lot of the code needed to run training and inference in this demo. Navigate to the folder where you want to store the cresi repo and go ahead and clone it with:

git clone https://github.com/avanetten/cresi

  1. Downloading SpaceNet 5 data from AWS S3 bucket

For this tutorial we will use public imagery from the SpaceNet 5 Challenge. These weights and images are are part of the Registry of Open Data on AWS, and can be downloaded for free. You will need an AWS account to access the data, and the AWS CLI tool installed. Then, simply execute aws configure from the command line and input your AWS Access Key ID and your AWS Secret Access Key.
Once AWS CLI is setup, you should be able to download the dataset directly from the public S3 bucket below. There are both train and test datasets listed below. The train datasets contain labels, while the test datasets do not contain labels. The commands below list all of the available relevant data, zipped up into .tar.gz files.

In [2]:
!aws s3 ls s3://spacenet-dataset/spacenet/SN5_roads/tarballs/ --human-readable
2019-09-03 20:59:32    5.8 GiB SN5_roads_test_public_AOI_7_Moscow.tar.gz
2019-09-24 08:43:02    3.2 GiB SN5_roads_test_public_AOI_8_Mumbai.tar.gz
2019-09-24 08:43:47    4.9 GiB SN5_roads_test_public_AOI_9_San_Juan.tar.gz
2019-09-14 13:13:26   35.0 GiB SN5_roads_train_AOI_7_Moscow.tar.gz
2019-09-14 13:13:34   18.5 GiB SN5_roads_train_AOI_8_Mumbai.tar.gz
In [3]:
!aws s3 ls s3://spacenet-dataset/spacenet/SN3_roads/tarballs/ --human-readable
2019-08-23 12:09:19  728.8 MiB SN3_roads_sample.tar.gz
2019-08-23 11:34:23    8.1 GiB SN3_roads_test_public_AOI_2_Vegas.tar.gz
2019-08-23 11:35:30    1.8 GiB SN3_roads_test_public_AOI_3_Paris.tar.gz
2019-08-23 11:30:48    8.0 GiB SN3_roads_test_public_AOI_4_Shanghai.tar.gz
2019-08-23 11:31:55    1.6 GiB SN3_roads_test_public_AOI_5_Khartoum.tar.gz
2019-08-21 10:27:25   24.3 GiB SN3_roads_train_AOI_2_Vegas.tar.gz
2019-09-03 12:00:38    1.4 MiB SN3_roads_train_AOI_2_Vegas_geojson_roads_speed.tar.gz
2019-08-21 10:27:25    5.5 GiB SN3_roads_train_AOI_3_Paris.tar.gz
2019-09-03 12:01:25  234.7 KiB SN3_roads_train_AOI_3_Paris_geojson_roads_speed.tar.gz
2019-08-21 10:27:25   24.0 GiB SN3_roads_train_AOI_4_Shanghai.tar.gz
2019-09-03 12:01:47    1.6 MiB SN3_roads_train_AOI_4_Shanghai_geojson_roads_speed.tar.gz
2019-08-21 10:27:25    4.9 GiB SN3_roads_train_AOI_5_Khartoum.tar.gz
2019-09-03 12:02:16  486.4 KiB SN3_roads_train_AOI_5_Khartoum_geojson_roads_speed.tar.gz

As you can see above, some of these files are multiple GBs, so please be aware of your space on your local disk before downloading.

An example of downloading a specific file and unzipping

aws s3 cp s3://spacenet-dataset/spacenet/SN5_roads/tarballs/SN5_roads_train_AOI_7_Moscow.tar.gz .
tar -xvzf ~/spacenet5data/moscow/SN5_roads_train_AOI_7_Moscow.tar.gz

An example of downloading a whole folder from S3

!aws s3 cp s3://spacenet-dataset/spacenet/SN3_roads/tarballs/ . --recursive

  1. Launching Jupyter Labs*

I have some instructions for launching Jupyter Lab on the Intel Developer Cloud instance, and then how you would connect in from your local machine on a web browser. After you have installed Jupyter Lab from the previous steps, from the command line, you can launch a Jupyter server. I prefer to do this in a tmux window, so that I can exit out still use the same terminal window. You can read more about tmux here on their GitHub. You can use the provided script:

jupyter lab --port=8080 --ServerApp.ip=* --no-browser

In order to be able to connect to the Jupyter server on a local browser, you can create an SSH tunnel with a command like the following in a different terminal window (connecting your local machine to the remote machine):

ssh -J guest@146.152.226.42 -L 8080:localhost:8080 devcloud@192.168.19.2

You should now be able to go to a local browser and enter the URL below to access your Jupyter server
https://localhost:8080

In [4]:
#Command to show how much room you have available on disk
!df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            51G  4.0M   51G   1% /run
/dev/nvme0n1p2  3.5T  415G  2.9T  13% /
tmpfs           252G  160K  252G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/nvme0n1p1  537M  5.3M  532M   1% /boot/efi
tmpfs            51G  4.0K   51G   1% /run/user/1000

Python* Package Imports for Notebook¶

In [5]:
#importing packages
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import skimage.io
import json
import pandas as pd

%load_ext autoreload
%autoreload 2

Image Dataset Preparation - Making Masks¶

The images have names like the following: SN5_roads_train_AOI_7_Moscow_PS-RGB_chip180.tif

  • SN5 means SpaceNet 5
  • PS-MS means pan-sharpened (PS) 8-band multispectral (MS) image
  • PS-RGB means pan-sharpened (PS) 3-band RGB image

The 8-band multispectral images are of shape (1300,1300,8), whereas the RGB images are of shape (1300,1300,3). Here in the notebook I only show the 3-channel RGB images, but feel free to explore ways to visualize the 8-band images.

Each "clip" of a satellite image has corresponding labels of where the streets are located. We need will teach a neural network about, so it can predict where streets are based on the satellite image alone.

After unzipping the SN5_roads_train_AOI_7_Moscow.tar.gz file, you should find the geojson_roads_speed folder within each area of interest (AOI) directory (nfs/data/cosmiq/spacenet/competitions/SN5_roads/tiles_upload/train/AOI_7_Moscow/geojson_roads_speed). It contains road centerline labels along with estimates of safe travel speeds for each roadway. We'll use these centerline labels and speed estimates to create training masks. We assume a mask buffer of 2 meters, meaning that each roadway is assigned a total width of 4 meters. Remember that the goal of our segmentation step is to detect road centerlines, so while this is not the precise width of the road, a buffer of 2 meters is an appropriate width for our segmentation model.

One option for training a segmentation model is to create training masks where the value of the mask is proportional to the speed of the roadway. This can be accomplished by running the speed_masks.py script (https://github.com/avanetten/cresi/blob/main/cresi/data_prep/speed_masks.py).

python3 /home/devcloud/cresi/cresi/data_prep/speed_masks.py --geojson_dir=/home/devcloud/spacenet5data/moscow/data/geojson_roads_speed \
        --image_dir=/home/devcloud/spacenet5data/moscow/data/PS-MS \
        --output_conversion_csv_binned=/home/devcloud/spacenet5data/moscow/data/v6/output_conversion_csv_binned/sn5_roads_train_speed_conversion_binned.csv \
        --output_mask_dir=/home/devcloud/spacenet5data/moscow/data/v6/train_mask_binned \
        --output_mask_multidim_dir=/home/devcloud/spacenet5data/moscow/data/v6/train_mask_binned_mc \
        --buffer_distance_meters=2 \
        --crs=None

Displaying Satellite Images and Corresponding Image Masks
Below, I am showing a sample of some of the satellite images in RGB format (each image is an array of shape (1300,1300,3), where 1300x1300 is the pixel count, and 3 are the three R-G-B color channels.

The generated mask images are the corresponding masks to these specific clips. The different colors of the roads indicate different speed limits.

In [6]:
#Showing some sample images
image_dir = '/home/devcloud/spacenet5data/moscow/train_data/PS-RGB' #replace with your local path to the images
image_names = ['SN5_roads_train_AOI_7_Moscow_PS-RGB_chip180.tif',
               'SN5_roads_train_AOI_7_Moscow_PS-RGB_chip181.tif',
               'SN5_roads_train_AOI_7_Moscow_PS-RGB_chip182.tif',
               'SN5_roads_train_AOI_7_Moscow_PS-RGB_chip183.tif']
full_path_images = [Path(image_dir,img) for img in image_names]

fig = plt.figure(figsize=(30, 10))
columns = 4
rows = 1
ax = []
for idx,num in enumerate(range(columns*rows)):
    img = skimage.io.imread(full_path_images[idx])
    ax.append(fig.add_subplot(rows,columns,num+1))
    ax[-1].set_title(image_names[idx])
    plt.imshow(img)
plt.show()
print(f'Image array shape is {img.shape}')
Image array shape is (1300, 1300, 3)
In [7]:
#Showing some sample images
image_dir = '/home/devcloud/spacenet5data/moscow/v10/train_mask_binned/' #replace with your local path to the images
image_names = ['SN5_roads_train_AOI_7_Moscow_PS-MS_chip180.tif',
               'SN5_roads_train_AOI_7_Moscow_PS-MS_chip181.tif',
               'SN5_roads_train_AOI_7_Moscow_PS-MS_chip182.tif',
               'SN5_roads_train_AOI_7_Moscow_PS-MS_chip183.tif']
full_path_images = [Path(image_dir,img) for img in image_names]

fig = plt.figure(figsize=(30, 10))
columns = 4
rows = 1
ax = []
for idx,num in enumerate(range(columns*rows)):
    img = skimage.io.imread(full_path_images[idx])
    ax.append(fig.add_subplot(rows,columns,num+1))
    ax[-1].set_title(image_names[idx])
    plt.imshow(img)
plt.show()
print(f'Image array shape is {img.shape}')
Image array shape is (1300, 1300)

Training Configuration Prep¶

Building Configuration JSON for train/validation split and training parameters
First, we need to build a JSON configuration file that defines how we will split up the image data into training and validation sets, and define our training parameters. The location of the default configuration file should be found in your cloned cresi directory. You can find a sample config here:
https://github.com/avanetten/cresi/blob/main/cresi/configs/sn5_baseline_aws.json.

My edited config file looks like:

{
    "path_src": "/home/devcloud/cresi/cresi",
    "path_results_root": "/home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04", 
    "train_data_refined_dir_ims": "/home/devcloud/spacenet5data/moscow/train_data/PS-MS",
    "train_data_refined_dir_masks": "/home/devcloud/spacenet5data/moscow/v10/train_mask_binned_mc",
    "speed_conversion_file": "/home/devcloud/spacenet5data/moscow/v10/output_conversion_csv_binned/sn5_roads_train_speed_conversion_binned.csv",
    "folds_file_name": "folds4.csv",
    "save_weights_dir": "sn5_baseline",
    "num_folds": 1,
    "default_val_perc": 0.2,
    "num_channels": 8,
    "num_classes": 8,
    "network": "resnet34",
    "loss": {
        "soft_dice": 0.25,
        "focal": 0.75
    },
    "early_stopper_patience": 8,
    "nb_epoch": 30,
    "test_data_refined_dir": "/home/devcloud/spacenet5data/moscow/test_data/PS-MS",
    "test_results_dir": "sn5_baseline",
    "folds_save_dir": "folds",
    "tile_df_csv": "tile_df.csv",
    "test_sliced_dir": "",
    "slice_x": 0,
    "slice_y": 0,
    "stride_x": 0,
    "stride_y": 0,
    "skeleton_band": 7,
    "skeleton_thresh": 0.3,
    "min_subgraph_length_pix": 20,
    "min_spur_length_m": 10,
    "GSD": 0.3,
    "rdp_epsilon": 1,
    "log_to_console": 1,
    "intersection_band": -1,
    "use_medial_axis": 0,
    "merged_dir": "merged",
    "stitched_dir_raw": "stitched/mask_raw",
    "stitched_dir_count": "stitched/mask_count",
    "stitched_dir_norm": "stitched/mask_norm",
    "wkt_submission": "wkt_submission_nospeed.csv",
    "skeleton_dir": "skeleton",
    "skeleton_pkl_dir": "sknw_gpickle",
    "graph_dir": "graphs",
    "padding": 22,
    "eval_rows": 1344,
    "eval_cols": 1344,
    "batch_size": 64,
    "iter_size": 1,
    "lr": 0.0001,
    "lr_steps": [
        20,
        25
    ],
    "lr_gamma": 0.2,
    "test_pad": 64,
    "epoch_size": 8,
    "predict_batch_size": 33,
    "target_cols": 512,
    "target_rows": 512,
    "optimizer": "adam",
    "warmup": 0,
    "ignore_target_size": false
}

Code to load JSON into Jupyter:

training_config = '/home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json'
f = open(training_config)
obj = json.load(f)
print(json.dumps(obj, indent=4))

Here are a few elaborations on some of these parameters:

parameter description
path_src directory to local clone of cresi repository folder
path_results_root directory where all results are saved, including the split of training and validation data, and trained model weights.
train_data_refined_dir_ims directory where all of the training images reside. For example, SN5_roads_train_AOI_7_Moscow_PS-MS_chip1241.tif is one of the images in this directory.
train_data_refined_dir_masks directory where the mask images reside. These images are generated with the speed_masks.py script referenced earlier. The images should have the exact same name as the training images.
speed_conversion_file path to the binned speed conversion CSV file. It should have 3 columns and look like: burn_val,speed,channel with the first line looking like 36,1,0. Speeds are in miles per hour, and there are 7 possible output channels for speeds, and the 8th channel is an aggregate of the rest.
folds_file_name This file will be saved in the path_results_root / weights / save_weights_dir directory. It is simply a list of the image file names, along with which fold number. SN5_roads_train_AOI_7_Moscow_PS-MS_chip379.tif,0. If the fold number is 0 in our case, it is part of the validation set.
save_weights_dir Directory where model weights are saved in path_results_root / weights path.
num_folds Number of validation fold datasets. "Using multiple cross-validated models can improve performance, though at the expense of increased run time at inference. For our baseline we use only a single fold, and randomly withhold 20% of the data for validation purposes during training. We use this validation data to test performance after each epoch, and truncate training if the validation loss does not decrease for 8 epochs." (Source)
default_val_perc Percentage of data for validation during training.
num_channels The number of channels in the training image. For PS-MS pan-sharpened (PS) 8-band multispectral (MS) images, you can put 8. If using PS-RGB pan-sharpened (PS) 3-band RGB images, you can put 3.
num_classes Number of bins of speed limits. In our case, we have binned the data into 8 speed limit buckets.
network Neural network architecture. Choices at the time of writing are: resnet34, resnet50, resnet101,seresnet50,seresnet101,seresnet152, seresnext50, or seresnext101. The choice of model architecture will influence the speed and accuracy of training.
loss Type of loss function. "Custom loss function comprised of 25% Dice Loss and 75% Focal Loss." (Source)
nb_epoch Number of epochs
batch_size Number of images to use in one batch to load into memory. Note that this can be increased or decreased depending on your memory requirements.
optimizer Optimizers at the time of writing that can be chosen are: adam, rmsprop, and sgd.

There are numerous other configuration parameters that can be adjusted for training time and inference time, but we will leave those alone.

Creating Validation Fold
We need to create the validation fold of data. To do that, we can use the 00_gen_folds.py script (https://github.com/avanetten/cresi/blob/main/cresi/00_gen_folds.py) with the previously configured parameter file. For example, I ran:

!python3 /home/devcloud/cresi/cresi/00_gen_folds.py /home/devcloud/cresi/cresi/configs/ben/v5_sn5_baseline_ben.json

The output is a folds CSV file: /home/devcloud/spacenet5data/moscow/results_xeon4_devcloud/weights/sn5_baseline/folds4.csv

There are folds 0, 1, 2, 3, and 4. The validation data are fold 0 (20%), and the training data are folds 1-4 (80%).

In [8]:
folds = '/home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/weights/sn5_baseline/folds4.csv'
pd.read_csv(folds)
Out[8]:
Unnamed: 0 fold
0 SN5_roads_train_AOI_7_Moscow_PS-MS_chip842.tif 0
1 SN5_roads_train_AOI_7_Moscow_PS-MS_chip1327.tif 1
2 SN5_roads_train_AOI_7_Moscow_PS-MS_chip561.tif 2
3 SN5_roads_train_AOI_7_Moscow_PS-MS_chip91.tif 3
4 SN5_roads_train_AOI_7_Moscow_PS-MS_chip1263.tif 4
... ... ...
1348 SN5_roads_train_AOI_7_Moscow_PS-MS_chip843.tif 3
1349 SN5_roads_train_AOI_7_Moscow_PS-MS_chip932.tif 4
1350 SN5_roads_train_AOI_7_Moscow_PS-MS_chip1032.tif 0
1351 SN5_roads_train_AOI_7_Moscow_PS-MS_chip1168.tif 1
1352 SN5_roads_train_AOI_7_Moscow_PS-MS_chip761.tif 2

1353 rows × 2 columns

Codebase Changes¶

I had to make some minor code changes to the cresi repo, described below, in order to run with the new Intel CPU with AMX.


  1. Make sure that the code runs natively on a CPU, instead of a GPU.
    • In https://github.com/avanetten/cresi/blob/main/cresi/net/pytorch_utils/train.py:
      • Replace self.model = nn.DataParallel(model).cuda() with self.model = nn.DataParallel(model)
    • In https://github.com/avanetten/cresi/blob/main/cresi/01_train.py:
      • Remove torch.randn(10).cuda()

  1. Optimize the training code with Intel Extension for PyTorch, to get the most benefit out of training on a CPU.

    • In https://github.com/avanetten/cresi/blob/main/cresi/net/pytorch_utils/train.py:

      • Add import intel_extension_for_pytorch as ipex with the import statements

      • Add line for IPEX optimization and bfloat16 for mixed precision training just after defining the model and optimizer:

        self.model = nn.DataParallel(model) #.cuda() only for GPU
        self.optimizer = optimizer(self.model.parameters(), lr=config.lr)
        

        self.model, self.optimizer = ipex.optimize(self.model, optimizer=self.optimizer,dtype=torch.bfloat16)

      • Add a line to do mixed precision on CPU just before running a forward pass and calculating the loss function:

        with torch.cpu.amp.autocast():

        if verbose:
                print("input.shape, target.shape:", input.shape, target.shape)
            output = self.model(input)
            meter = self.calculate_loss_single_channel(output, target, meter, training, iter_size)
        

  1. Optimize inference code with Intel Extension for PyTorch
    • In https://github.com/avanetten/cresi/blob/main/cresi/net/pytorch_utils/eval.py:
      • Add import intel_extension_for_pytorch as ipex in the imports
      • After loading the PyTorch model, use IPEX to optimize the model for bfloat16 inference
        model = torch.load(os.path.join(path_model_weights, 'fold{}_best.pth'.format(fold)),  map_location=lambda storage, loc: storage)
        model.eval()
        
        model = ipex.optimize(model,dtype=torch.bfloat16)
      • Just prior to running prediction on the data, add 2 lines to do mixed precision on CPU:
        with torch.no_grad():
            with torch.cpu.amp.autocast():
                for data in pbar:
                    samples = torch.autograd.Variable(data['image'], volatile=True)
                    predicted = predict(model, samples, flips=self.flips)
        

  1. Replace outdated torch code.
    • In https://github.com/avanetten/cresi/blob/main/cresi/net/pytorch_utils/loss.py:
      • Replace all of the outdated .view functions with .reshape.
    • In https://github.com/avanetten/cresi/blob/main/cresi/net/pytorch_utils/train.py:
      • Replace all F.sigmoid (or torch.nn.functional.sigmoid) with torch.sigmoid
      • Replace all .clip_grad_norm with .clip_grad_norm_
      • Replace self.estimator.lr_scheduler.step(epoch) with self.estimator.lr_scheduler.step()
      • Remove the following lines entirely, as autograd is no longer needed:
        • input = torch.autograd.Variable(input.cuda(async=True), volatile=not training)
        • target = torch.autograd.Variable(target.cuda(async=True), volatile=not training)

Begin Training ResNet34 + UNet¶

Let's recap. We have:

  • Created speed-limit image masks to correspond to our training satellite image data
  • Defined our configuration file for training
  • Split up our training and validation data
  • Optimized our code for CPU

I am training the whole ResNet34 + UNet backbone, and keeping no layers frozen during training. We now can begin our training run with the 01_train.py script (original here: https://github.com/avanetten/cresi/blob/main/cresi/01_train.py).

ipexrun --ninstances 1 --ncore_per_instance 32 /home/devcloud/cresi/cresi/01_train.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json --fold=0

The ipexrun is a simplified way of using the numactl command line tool to specify which hardware configurations to use. In my case, I am asking my training run to:

  • Keep all of the work on 1 socket (--ninstances 1)
  • Keep all of the work on 32 physical cores (--ncore_per_instance 32)

I found that the training performed better while staying on 1 socket, because of the slow timing of communicating across sockets. I also found that going up to the maximum of 56 physical cores did improve the epoch time slightly to 14.5 minutes from 15.75 minutes, but that overall keeping the training to 1 socket rather than across 2 sockets is what mattered most.

During Training Run
During training, you can launch TensorBoard with the appropriate log directory to monitor the loss function progress over time:

tensorboard --logdir /home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/logs/sn5_baseline/fold0 --port 8090

And then you can create another SSH tunnel:

ssh -J guest@146.152.226.42 -L 8090:localhost:8090 devcloud@192.168.19.2

and launch in your local browser at https://localhost:8090. It should look like:

image.png

Training Run Results

  • Each training epoch took only around 15 minutes to run.
  • The total training over 30 epochs took around 7.25 hours to run.

Inference¶

To run inference, we can use the 02_eval.py script (https://github.com/avanetten/cresi/blob/main/cresi/02_eval.py). Remember that we did modify a few lines to accomodate AMX and BF16 (see Codebase Changes above).

python3 /home/devcloud/cresi/cresi/02_eval.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json

Here is a sample mask output that was predicted.

In [9]:
img_path = '/home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/sn5_baseline/folds/fold0_SN5_roads_test_public_AOI_7_Moscow_PS-MS_chip95.tif'
# inspect
mask_pred = skimage.io.imread(img_path)
print("mask_pred.shape:", mask_pred.shape)

# plot all layers
fig, axes = plt.subplots(2, 4, figsize=(16, 9))
for i, ax in enumerate(axes.flatten()):
    if i < (len(axes.flatten()) - 1):
        title = 'Mask Channel {}'.format(str(i))
    else:
        title = 'Aggregate' 
    ax.imshow(mask_pred[i,:,:])
    ax.set_title(title)
mask_pred.shape: (8, 1300, 1300)

Here is the original image and the mask aggregate prediction.

In [10]:
full_path_image = '/home/devcloud/spacenet5data/moscow/test_data/PS-RGB/SN5_roads_test_public_AOI_7_Moscow_PS-RGB_chip95.tif'
full_path_mask = '/home/devcloud/spacenet5data/moscow/v10_xeon4_devcloud22.04/sn5_baseline/folds/fold0_SN5_roads_test_public_AOI_7_Moscow_PS-MS_chip95.tif'

img1 = skimage.io.imread(full_path_image)
img2 = skimage.io.imread(full_path_mask)[7,:,:]

f, axarr = plt.subplots(1,2,figsize=(16, 9))
axarr[0].imshow(img1)
axarr[0].set_title('Satellite Image chip 95')
axarr[1].imshow(img2)
axarr[1].set_title('Prediction Mask chip 95')
Out[10]:
Text(0.5, 1.0, 'Prediction Mask chip 95')

Model Limitations¶

I realize that the model that I've trained is overfit on the Moscow image data and likely should not generalize well to other cities. However, the winning solution to this challenge used data from 6 cities (Las Vegas, Paris, Shanghai, Khartoum, Moscow, Mumbai) and performed well on a new city.

In the future, one thing that would be worth testing is training on all 6 cities and running inference on another city to reproduce their results.

Post-Processing¶

Merge, stitch, skeletonize
There are further post-processing steps that can be performed to add the mask as graph features to maps. You can read more about the post-processing steps here:

  • The SpaceNet 5 Baseline — Part 3: Extracting Road Speed Vectors from Satellite Imagery

And the post-processing scripts can be found here:

  • https://github.com/avanetten/cresi/tree/main/cresi

The text outputs are quite verbose, and so I recommend doing these in a separate terminal (and with tmux), rather than in a notebook.

python3 /home/devcloud/cresi/cresi/03a_merge_preds.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json
python3 /home/devcloud/cresi/cresi/03b_stitch.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json
python3 /home/devcloud/cresi/cresi/04_skeletonize.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json
python3 /home/devcloud/cresi/cresi/05_wkt_to_G.py /home/devcloud/cresi/cresi/configs/ben/v10_xeon4_baseline_ben.json

Recap¶

Summary of accomplishments:

  • Created 1352 image training masks (with speed limits) to correspond to our training satellite image data (from .geojson text file labels)
  • Defined our configuration file for training and inference
  • Split up our training and validation data (80/20)
  • Optimized our code for CPU training, including using Intel Extension for PyTorch and BF16
  • Trained a performant ResNet34 + UNet model on a Sapphire Rapids CPU in ~7.25 hours for 30 epochs
  • Ran initial inference to see the prediction of a speed limit mask

This fine-tuning of an image segmentation model (for a ResNet34 + UNet model) led to the following conclusions for me:

  1. Deep learning training on CPUs is now feasible when using Sapphire Rapids
  2. It was 3.1X faster to train per epoch on Sapphire Rapids (15.75 minutes) than on the previous 3rd Gen Intel® Xeon® Scalable Processor (Ice Lake) (48.92 minutes) with the same code/configurations.
  3. It was comparable in training time to an NVIDIA* Tesla T4 GPU (13.13 minutes). The difference is for the NVIDIA GPU, none of the CPU-specific optimizations were completed. The plain GPU training code at https://github.com/avanetten/cresi was used.

You can find much more detailed benchmarks here for Sapphire Rapids, including statements concerning the NVIDIA A100 GPU: https://edc.intel.com/content/www/us/en/products/performance/benchmarks/4th-generation-intel-xeon-scalable-processors/

For me as a data scientist, I more care about the fact that I can do a training run over night (or during a workday) than about specific performance benchmarks. The new Sapphire Rapids CPU brings home the new reality of fine-tuning training on a CPU, which is exciting. Happy coding!


Learn more:¶

Start by using the Intel Extension for PyTorch:

pip install intel-extension-for-pytorch

git clone https://github.com/intel/intel-extension-for-pytorch

Please let me know if you used the code by connecting with me on one of the following places:


Please join me on Intel's Developer Discord to have further discussion following the talk. My user is silvos#5002
Invite link: https://discord.gg/rv2Gp55UJQ

| | | | | | | | ¶


Notices and Disclaimers¶

Important Legal Notices Performance varies by use, configuration and other factors.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See configuration disclosure for details. No product or component can be absolutely secure.

Intel optimizations, for Intel compilers or other products, may not optimize to the same degree for non-Intel products.

Your costs may vary.

Intel technologies may require enabled hardware, software or service activation.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Altering clock frequency or voltage may void any product warranties and reduce stability, security, performance, and life of the processor and other components. Check with system and component manufacturers for details.

Intel contributes to the development of benchmarks by participating in, sponsoring, and/or contributing technical support to various benchmarking groups, including the BenchmarkXPRT Development Community administered by Principled Technologies.

Read our Benchmarks and Measurements disclosures and our Battery Life disclosures.

Statements on Intel's websites that refer to future plans or expectations are forward-looking statements. These statements are based on current expectations and involve many risks and uncertainties that could cause actual results to differ materially from those expressed or implied in such statements. For more information on the factors that could cause actual results to differ materially, see our most recent earnings release and SEC filings at https://www.intc.com/.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

*Trademarks